Deep learning based on ultrasound images assists breast lesion diagnosis in China: a multicenter diagnostic study

Background Studies on deep learning (DL)-based models in breast ultrasound (US) remain at the early stage due to a lack of large datasets for training and independent test sets for verification. We aimed to develop a DL model for differentiating benign from malignant breast lesions on US using a large multicenter dataset and explore the model’s ability to assist the radiologists. Methods A total of 14,043 US images from 5012 women were prospectively collected from 32 hospitals. To develop the DL model, the patients from 30 hospitals were randomly divided into a training cohort (n = 4149) and an internal test cohort (n = 466). The remaining 2 hospitals (n = 397) were used as the external test cohorts (ETC). We compared the model with the prospective Breast Imaging Reporting and Data System assessment and five radiologists. We also explored the model’s ability to assist the radiologists using two different methods. Results The model demonstrated excellent diagnostic performance with the ETC, with a high area under the receiver operating characteristic curve (AUC, 0.913), sensitivity (88.84%), specificity (83.77%), and accuracy (86.40%). In the comparison set, the AUC was similar to that of the expert (p = 0.5629) and one experienced radiologist (p = 0.2112) and significantly higher than that of three inexperienced radiologists (p < 0.01). After model assistance, the accuracies and specificities of the radiologists were substantially improved without loss in sensitivities. Conclusions The DL model yielded satisfactory predictions in distinguishing benign from malignant breast lesions. The model showed the potential value in improving the diagnosis of breast lesions by radiologists. Supplementary Information The online version contains supplementary material available at 10.1186/s13244-022-01259-8.


Visualization of the model
The Class Activation Mapping (CAM) algorithm [3] was employed to visualize how our model interpreted the breast US images for predicting the risk of breast lesions. This algorithm applies color mapping to images according to the weight of the feature extraction layer of the model, and creates heatmaps in which the color is warmer in areas of strong attention and colder in areas of weak attention (Supplementary Figure 4).

Comparison of the diagnostic performance between the model and radiologists
Among the 201 patients randomly selected from the test sets, 101 patients (39 malignant breast lesions and 62 benign breast lesions) with 307 images were randomly selected from the internal test set, and 100 patients (49 malignant breast lesions and 51 benign breast lesions) with 349 images were randomly selected from the external test sets. The comparison of the diagnostic performance between the model and radiologists in the comparison set from the internal test set and external test sets is shown in Supplementary Table 3 and Supplementary Table  4, respectively.

Data and statistical analysis
The intra-observer variability between the radiologist's assessment in the first reading alone and the assessment in the second reading before showing the DL model prediction was calculated by using kappa statistics. The level of agreement (kappa value, κ) was defined as follows: ≤ 0.20, slight agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; and 0.81-1.00, almost perfect agreement [4][5]. The intra-observer agreement in BI-RADS assessment was perfect for radiologist 1, substantial for radiologist 2, and moderate for radiologists 3,4 and 5 (Supplementary Table 9).
Decision curve analysis (DCA) was used to test the clinical usefulness of the DL model and radiologists with and without model assistance in breast cancer prediction. DCA was performed to evaluate the clinical utility of the prediction model by quantifying the net benefits when different threshold probabilities were considered [6][7][8]. The y-axis measures the net benefit. The x-axis shows the corresponding risk threshold. The grey line represents the assumption that all lesions were malignant lesions. The black straight line represents the assumption that all lesions were benign lesions (Supplementary Figure 6). Statistical analyses were performed by using SPSS software (version 25.0 for Windows; IBM Corporation, Armonk, NY), MedCalc Statistical Software (version 18.2.1, MedCalc Software bvba, Ostend, Belgium), and R language (version 3.6.2). A P value of <0.05 was considered statistically significant.

Breast Ultrasound Images Dataset
In addition, we also evaluated our DL model on the publicly available Breast Ultrasound Images (BUSI) dataset [9]. The BUSI dataset contains 437 benign images, 210 malignant images, and 133 normal images. Normal images were excluded in the study as while as the images with marks (caliper measurement, vascularity, and text) blocking the mass, and only one mass on each image were included. A total of 504 images (504 masses) were used in this study.  The images of the lesion on B-mode US (a, c, e & g, i, k) and heatmaps (b, d, f & h, j, l). (Left) A 51-year-old woman with palpable breast mass, which was confirmed as invasive ductal carcinoma. This lesion was classified as Breast Imaging-Reporting and Data System (BI-RADS) category 4c, 4c, 5, 5, and 5 by the five radiologists, respectively. The DL model output the predicted probability scores were 0.999 and 0.001 for malignant and benign lesion, respectively. The binary prediction result was malignant. The DL model provided a strong suggestion of malignancy to radiologists as a supportive decision and enhanced the confidence of the radiologists' diagnoses. (Right) A 35-year-old woman with asymptomatic breast mass, which was confirmed as fibroadenoma. This lesion was classified as Breast Imaging-Reporting and Data System (BI-RADS) category 3, 4a, 4a, 4a, and 4a by the five radiologists, respectively. The DL model output the predicted probability scores were 0.016 and 0.984 for malignant and benign lesion, respectively. The binary prediction result was benign. The breast lesion was downgraded to category 3 by using method one for one experienced and three inexperienced radiologists.
Supplementary Figure 6 Comparison of decision curves of breast cancer prediction in the comparison between the DL model and radiologists with and without model assistance in the comparison set.
The decision curves indicate that using the DL model (red line) to predict breast cancer adds more benefit to patients than the radiologists alone (green line) in the comparison set. The radiologists with model assistance-method two (purple line) brings more benefit than the radiologists with model assistance-method one (blue line) in the comparison set. AI=artificial intelligence, R= radiologists without model assistance, RM1= radiologists with model assistance method one, RM2= radiologists with model assistance method two.